AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding

Protein function annotation has been one of the longstanding issues in biological sciences, and various computational methods have been developed. However, the existing methods suffer from a serious long-tail problem, with a large number of GO families containing few annotated proteins. Herein, an innovative strategy named AnnoPRO was therefore constructed by enabling sequence-based multi-scale protein representation, dual-path protein encoding using pre-training, and function annotation by long short-term memory-based decoding. A variety of case studies based on different benchmarks were conducted, which confirmed the superior performance of AnnoPRO among available methods. Source code and models have been made freely available at: https://github.com/idrblab/AnnoPRO and https://zenodo.org/records/10012272 Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03166-1.

ProMAP had only one channel (Single-channel map) and AnnoPRO whose ProMAP was shuffled (shuffled map).The results showed that every change in the model algorithm led to worse results in all Gene Ontology (GO) classes (BP, CC, MF), especially the removal of the LSTM (M 3).In other words, AnnoPRO is the optimal model we had built so far.BP: biological process; CC: cellular component; MF: molecular function; Fmax: protein centric maximum Fmeasure; AUPRC: area under the precision-recall curve.were hierarchically connected to them.In this study, the level of root nodes was defined as 'Level 1' (blue).The child families directly connected to the root nodes were labeled as 'Level 2' (pink).
Then, the families of 'Level 3' were defined by those child families directly connected to 'Level 2'.The following levels can be thus deduced in the similar manner.Based on our comprehensive evaluation on all GO data, the bottom level of GO's hierarchical multi-label structure was 'Level 11'(blue), which had no child family and composed of the smallest number of proteins comparing with the families in other levels (Level 1 to Level 10).The performances were represented using AUC values in predicting the experimentally validated new protein functions that were not included in CAFA4 data, and performances of AnnoPRO with 5-, 7-, 9-, 11-, and 13-time-step input data are highlighted in green, blue, yellow, red, and orange, respectively.It is evident that AnnoPRO achieves the best prediction performance when using 13-time-step input data.(b) The figure also displays the impact of different numbers of LSTM layers on protein function prediction for AnnoPRO.The performances of AnnoPRO with 1-, 2-, 3-, and 4-layer LSTM are highlighted in green, yellow, red, and orange, respectively.It can be observed that AnnoPRO achieves the highest prediction performance when using a 3layer LSTM configuration.These results highlight the influence of hyperparameter choices on the performance of the AnnoPRO model.The results suggest that using 13-time-step input data and a 3-layer LSTM yield the best performance for protein function annotation.
Table S1.AUC of nine degrees from level 1 to 9 to evaluate AnnoPRO and three representative methods (DeepGOPlus, NetGO3 and PFmulDL).Those values indicating the best performances among all methods were highlighted in BOLD, and AnnoPRO performed the best in the vastmajority (8/9) of the Gene Ontology classes (BP, CC, MF) under AUC.AnnoPRO was identified superior in significantly improving the annotation performances of the families in 'Tail Label Levels' without sacrificing that of the 'Head Label Levels', which was highly expected to make contribution to solving the long-standing 'long-tail problem' in functional annotation.were able to provide the training code, allowing us to ensure transparency and reproducibility in our experiments.To conduct a fair evaluation, we followed the exact training and testing procedures as described in the Materials and Methods.This involved using the same datasets that were utilized in the original papers and code repositories of each method.By adhering to these standardized procedures, we aimed to maintain consistency and facilitate a direct comparison of the performance of these methods.For each of the aforementioned methods, we retrained the models from scratch using our training code, following the specific methodologies outlined in their respective papers and code repositories.This ensured that our retrained models were consistent with the original implementations, allowing us to assess their performance accurately.However, it is important to note that for the models NetGO2 and NetGO3, we did not have access to the training code.Therefore, we were unable to retrain these models.Instead, we evaluated and tested the existing models provided by the original authors using the same testing dataset.Although we couldn't retrain these models ourselves, we maintained a consistent evaluation framework to fairly compare their performance with the other ML-based methods.

AnnoPRO
By including these state-of-the-art ML-based methods and conducting the experiments in a standardized manner, we aimed to provide a comprehensive and robust evaluation of different approaches for protein function prediction.

Method S2. The Process of Feature Reset and Its Detailed Methodology
The process of "feature reset" based on feature distance matrix (FDM) consisted of two key steps: 'dimensionality reduction' (by applying UMAP or PCA for reducing the dimensionality of each feature from 1,484D to 2D) and 'coordinate allocation' (by applying J-V algorithms to allocate all those 1,484 features to distinct coordinates in a 39×39 map, named 'template map').
First, based on the feature distance matrix (FDM) (1,484×1,484), all the protein features were projected onto a 2-dimensional space as scatter points (1,484×2).Taking one feature   as an example, it would be originally represented as a 1,484D vector (  ): where   , indicated pair-wise distance between feature   and   .Second, the feature vector   was mapped into a 2D vector (  ) that is easy to understand and present, by calculating their interrelationships on the manifold surface utilizing UMAP: where    and    denoted the coordinate values of the feature   in a two-dimensional plane as shown in Figure 3a.Third, in order to allocate these 2D vector of protein features (  ) into the Template Map, a 39×39 map () was defined to store the allocation results of the protein features.Taking the feature   as an example, it would be represented as a grid (  ) in this Template Map: where    and    were integers from 0 to 38 indicating the coordinate of the feature   .
Finally, the grid locations of these features were allocated by minimizing the total cost between   and   while using the Jonker-Volgenant (J-V) algorithm: As a result, the 1,484 protein features were transformed from a 'unordered' vector to an 'ordered' image-like representation.

Fig S1 .
Fig S1.Result of Ablation experiment.The performances were represented using delta ratio of evaluating criteria (Fmax and AUPRC) in predicting the experimentally validated new protein functions that were not included in CAFA4 data, and the performances of Fmax and AUPRC were highlighted in bule plus sign and orange circle, respectively.Six comparison models were constructed and evaluated: AnnoPRO without ProSIM (No ProSIM), AnnoPRO without ProMAP (No ProMAP), AnnoPRO without LSTM (No LSTM), AnnoPRO with directly inputting the 1484 unordered features of proteins into the model (No map), AnnoPRO whose

Fig S2 .
Fig S2.Comparison among the performances of AnnoPRO using different dimensionality reduction methods (PCA and UMAP).The performances were represented using Fmax and AUPRC values and the performances of AnnoPROPCA, and AnnoPROUMAP were highlighted in light blue and light red, respectively.The performances of these two models are roughly the same across three GO classes (BP, CC, MF).Particularly, AnnoPROUMAP showed a slightly better predictive performance compared with AnnoPROPCA (0.6~1.9% for Fmax; 11 1.4~2.1% for AUPRC).BP: biological process; CC: cellular component; MF: molecular function; Fmax: protein centric maximum F-measure; AUPRC: area under the precision-recall curve.

Fig S3 .
Fig S3.Schematic illustration of the hierarchical multi-label structure of GO families (labeled by f i ).Three root families were provided at the top of the structure, which included biological process (BP), molecular function (MF), and cellular component (CC), and the remaining families

Fig S4 .
Fig S4.Performance assessment of four methods using Heat shock 70 kDa protein 1A (HSPA1A).The results of functional annotation predicted by four studied methods.If a GO family is successfully prediction by one method, a colored circle will be used to indicate the prediction result.Particularly, a successful prediction made by AnnoPRO, NetGO3, PFmulDL or DeepGOPlus was indicated by a circle of light red, orange, light blue or light green, respectively.Compared with HSPA2 , another heat shock protein (represented in Fig. S7), the unique GO annotation is identified by red block.As shown, AnnoPRO can successfully predict most GO families for HSPA1A.

Fig S5 .
Fig S5.Performance assessment of four methods using Heat shock 70 kDa protein 2 (HSPA2).The results of functional annotation predicted by four studied methods.If a GO family is successfully prediction by one method, a colored circle will be used to indicate the prediction result.Particularly, a successful prediction made by AnnoPRO, NetGO3, PFmulDL or DeepGOPlus was indicated by a circle of light red, orange, light

Fig
Fig S6.A comparison of model performance using different hyperparameters.(a) The graph demonstrates the effect of varying the time step on protein function prediction for the model.

Table S2 .
Seven classes of protein descriptors generated using PROFEAT covered by AnnoPRO.These classes of descriptors could be further divided to a variety of descriptor types, and a total of 1,484 descriptors could therefore be finally generated in this study.The total number of the protein descriptors under each descriptor type was shown.The definition of each descriptor could also be found at https://github.com/idrblab/AnnoPRO.

Table S3 .
The hyperparameters considered in this study.The name of the hyperparameters and the optimized setting applied in AnnoPRO were explicitly described.The Processes of Existing Methods for Model Construction Herein, we included several state-of-the-art ML-based methods for protein function prediction, namely DeepGO, DeepGOCNN, DeepGOPlus, TALE, and PFmulDL.For these methods, we